Seminar 06491: Digital Historical Corpora
نویسندگان
چکیده
The seminar brought together scholars from (historical) linguistics, (historical) philology, computational linguistics and computer science who work with collections of historical texts. These texts or digital libraries or corpora are collected for a number of different purposes such as lexicography, history, linguistics, philology etc. This, naturally, leads to different decisions in their design and architecture. However, there are many issues that are common to many projects working with historical texts. These include: Standards and methods of digitization: historical texts have to be digitized from different sources. Sometimes it is necessary to digitize directly from a manuscript or early print. In these cases it is not possible to use current OCR technology, and the texts have to be double keyed (for example according to the standards developed in the Kompetenzzentrum Retrodigitalisierung in Trier). Newer texts can sometimes be scanned and OCRed, although even the relatively ‘clean’ 19 century newspaper texts are often problematic. Fraktur and some other scripts (e.g. old Cyrillic scripts) also pose problems for OCR. For some research questions it is possible to work with editions. In these cases the digitization itself is not an issue (if the editions are new). It has to be decided, however, how to deal with a critical apparatus. Design (composition) of corpora: While literary scholars often work on one text (or a small number of related texts), many research questions in linguistics and lexicography require a collection of several texts. Corpus design is, of course, always an issue in corpus construction. Ideally a matrix of the necessary parameters (text type, author, time etc.) is constructed and all ‘cells’ are filled with the appropriate texts. For older time periods this is often not possible since the texts might not have survived. A ‘skewed’ corpus, of course, only permits certain research questions. Standards and methods of annotation: For many research questions it is not sufficient to have the ‘naked’ text. The texts need to be annotated with further information. The texts need (a) header annotation (information about the whole text), (b) positional annotation (annotation for each token), and (c) structural annotation. The Text Encoding Initiative and other groups have developed suggestions for historical texts (the most detailed suggestions pertain to the header annotation). Annotation often cannot be done automatically since older texts are less standardized than newer texts – it is difficult to develop statistical or rule-based methods. It is necessary to discuss possible automation. It is also necessary to develop good annotation tools for manual or semi-automatic annotation. Corpus architecture: Most large modern corpora are stored in some table or tree format. Such architectures might not be the best option for historical corpora since they cannot accommodate conflicting annotation. Therefore one has to think about alternatives like multi-layer models or database models.
منابع مشابه
Guideline: Multiple Hierarchies
As the title of the Dagstuhl Seminar Digital Historical Corpora Architecture, Annotation, and Retrieval already suggests, corpus architecture and corpus annotation is an important topic for representing (historical) texts. Especially the limitation of SGML-based markup languages to tree structured annotations raises a special problems when dealing with manuscripts: How is it possible to represe...
متن کاملGerManC - Towards a Methodology for Constructing and Annotating Historical Corpora
for 'Digital Historical Corpora Architecture, Annotation, and Retrieval' Conference, 03-08 December 2006, Dagstuhl (D) GerManCTowards a Methodology for Constructing and Annotating Historical Corpora Astrid Ensslin, Martin Durrell, Paul Bennett University of Manchester (UK) Our paper focuses on the one hand on the challenges posed by the structural variability, flexibility and ambiguity found in...
متن کاملAutomatic linguistic annotation of historical language: ToTrTaLe and XIX century Slovene
The paper describes a tool developed to process historical (Slovene) text, which annotates words in a TEI encoded corpus with their modern-day equivalents, morphosyntactic tags and lemmas. Such a tool is useful for developing historical corpora of highly-inflecting languages, enabling full text search in digital libraries of historical texts, for modernising such texts for today's readers and m...
متن کاملInformation Retrieval from Historical Corpora
With the increasing number of documents that are available in digital form, also the number of digital historical documents is increasing (Berkvens, 2001). It cannot be assumed that standard IR systems perform well on historical documents: historical texts differ from modern texts in three ways (Hüning, 1996; Van Der Horst and Marschall, 1989): (a) vocabularies have changed, (b) spelling has ch...
متن کاملTagging Historical Corpora - the problem of spelling variation
Spelling issues tend to create relatively minor (though still complex) problems for corpus linguistics, information retrieval and natural language processing tasks that use ‘standard’ or modern varieties of English. For example, in corpus annotation, we have to decide how to deal with tokenisation issues such as whether (i) periods represent sentence boundaries or acronyms and (ii) apostrophes ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007